library(GGally)
Registered S3 method overwritten by 'GGally':
method from
+.gg ggplot2
c(dim(data_exp), dim(data_met), dim(data_mirna))
[1] 19940 173 5000 194 561 188
Figure Source: Rappoport, N., & Shamir, R. (2018). Multi-omic and multi-view clustering algorithms: review and cancer benchmark. Nucleic Acids Research, 46(20), 10546–10562. https://doi.org/10.1093/nar/gky889
dim(bound_matrices)
[1] 25501 197
Matrix factorization techniques attempt to infer a set of latent variables from the data by finding factors of a data matrix. Principal Component Analysis (PCA) is a form of matrix factorization which finds factors based on the covariance structure of the data. Generally, matrix factorization methods may be formulated as:
X = WH,
where: - X is the data matrix \[M×N\], M is the number of features (for example genes)and N is the number of samples - W is an \[M×K\] factor matrix, - H is the \[KxN\] latent variable coefficient matrix
Tying this back to PCA, where X = UΣV^T, we may formulate the factorization in the same terms by setting W = UΣ and H = V^T. If K = rank(X), this factorization is lossless, i.e. X=WH. However, normally we seek a latent variable model with a considerably lower dimensionality than X (a low-rank representation of our data). In this case, we choose K < rank(X) and the factorization is lossy, i.e X ≈ WH.
Thus, matrix factorization methods normally opt to minimize the error: min ∥X − WH∥.
The loss function we choose to minimize may be further subject to some constraints or regularization terms. In the context of latent factor models, a regularization term might be added to the loss function, i.e. we might choose to minimize min ∥X − WH∥ + λ∥W∥^2 (so called L2-regularization), instead of merely the reconstruction error. Adding such a term to our loss function here will push the W matrix entries towards 0, in effect balancing between better reconstruction of the data and a more parsimonious model. A more parsimonious latent factor model is one with more sparsity in the latent factors, which is desirable for model interpretation.
Multiple factor analysis is an example of early integration method and a straightforward extension of PCA into the domain of multiple data types. The figure below illustrates a naive extension of PCA to a multi-omics context, where data matrices from different platforms are concatenated, before applying PCA. The PCA finds the linear combinations of the features which, when the data is projected onto them, preserve the most variance of any K-dimensional space. But because measurements from different experiments have different scales, they will also have variance (and co-variance) at different scales. Multiple Factor Analysis addresses this issue and achieves balance among the data types by normalizing each of the data types X(i) separately. Each feature matrix is divided by the first eigenvalue λ_1 of the principal component decomposition of said matrix. After this normalization step, the matrices are “stacked” and passed on to PCA.
Let’s check how much of the variance in the stacked normalized data is explained by the PCA.
Since the first PCA only explains 7% of the variance, and together the first five dimensions only explain ~25% of the variance, we do not expect a PCA plot to separate the subtypes. But let’s plot it and see.
fviz_mfa_ind(r.mfa, col.ind = subtypes_data$Subtype_Selected, geom="point", axes=c(1,2))
Warning in if (col.ind %in% c("cos2", "contrib", "coord")) partial = NULL :
the condition has length > 1 and only the first element will be used
fviz_mfa_ind(r.mfa, col.ind = subtypes_data$Subtype_Selected, geom="point", axes=c(2,3))
Warning in if (col.ind %in% c("cos2", "contrib", "coord")) partial = NULL :
the condition has length > 1 and only the first element will be used
So we do see separation of AML.3 from the other subtypes. It certainly appears that AML.3 is a distinct subtype, even from this basic analysis. Now PCA plots focus on the global structure of the data, but other methods allow you to balance the global structure in the data with the local structure of the data. Let’s see if using T-SNE one such method allows us to see the same structure, albeit from a little different perspective.
mfa.h.tsne <- tsne(mfa.h)
Warning in if (class(X) == "dist") { :
the condition has length > 1 and only the first element will be used
sigma summary: Min. : 0.415658909962982 |1st Qu. : 0.538223861612916 |Median : 0.587855264486558 |Mean : 0.595665617862085 |3rd Qu. : 0.637941969661125 |Max. : 0.885911269418172 |
Epoch: Iteration #100 error is: 14.2728346224993
Epoch: Iteration #200 error is: 0.434572569562079
Epoch: Iteration #300 error is: 0.429238210078075
Epoch: Iteration #400 error is: 0.428805316301451
Epoch: Iteration #500 error is: 0.428767685951475
Epoch: Iteration #600 error is: 0.428762922328892
Epoch: Iteration #700 error is: 0.428762275791519
Epoch: Iteration #800 error is: 0.428762181982562
Epoch: Iteration #900 error is: 0.428762165484551
Epoch: Iteration #1000 error is: 0.428762162690721
mfa.h.tsne.tbl <- as_tibble(mfa.h.tsne) %>%
rename(tSNE_1=V1, tSNE_2=V2) %>%
mutate(sample_codes = rownames(mfa.h))
Warning: The `x` argument of `as_tibble.matrix()` must have unique column names if `.name_repair` is omitted as of tibble 2.0.0.
Using compatibility `.name_repair`.
This warning is displayed once every 8 hours.
Call `lifecycle::last_warnings()` to see where this warning was generated.
p1 <- mfa.h.tsne.tbl %>%
left_join(subtypes_data, by = "sample_codes") %>%
ggplot() + geom_point(aes(x=tSNE_1, y=tSNE_2, color=as.factor(Subtype_miRNA), size=1, alpha=0.5)) +
labs(title="tSNE for MFA [miRNA subtypes]", color="miRNA subtype") + guides(alpha=FALSE, size=FALSE)
Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> = "none")` instead.
p2 <- mfa.h.tsne.tbl %>%
left_join(subtypes_data, by = "sample_codes") %>%
ggplot() + geom_point(aes(x=tSNE_1, y=tSNE_2, color=as.factor(Subtype_Selected), size=1, alpha=0.5)) +
labs(title="tSNE for MFA [mRNA subtypes]", color="mRNA subtype") + guides(alpha=FALSE, size=FALSE)
Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> = "none")` instead.
ggarrange(p1, p2, ncol = 1)
Again, we see that AML.3 is well separated, and the separation between AML.3 and other subtypes is clearer in this plot
mfa.h.tsne.tbl %>%
left_join(subtypes_data, by = "sample_codes") %>%
select(tSNE_1, tSNE_2, exp, met, mirna) %>%
gather(omic, sample.available, -tSNE_1, -tSNE_2) %>%
ggplot() +
geom_point(aes(x=tSNE_1, y=tSNE_2, color=sample.available, size=1, alpha=0.1)) +
facet_wrap(omic~., ncol=2) +
labs(title="tSNE for MFA [sample availability]", color="sample available in omic") +
guides(alpha=FALSE, size=FALSE)
Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> = "none")` instead.
Another way to examine the MFA factors, which does not require us to do additional data transformation is a heatmap.
A common analysis in biological investigations is clustering. This is often interesting in cancer studies as one hopes to find groups of tumors (clusters) which behave similarly, i.e. have similar risks and/or respond to the same drugs. PCA is a common step in clustering analyses, and so it is easy to see how the latent variable models above may all be a useful pre-processing step before clustering.
Here, we use K-means clustering method to investigate clusters in our MFA results. This method divides or partitions the data points, our working example patients, into a pre-determined, K number of clusters. How do we know which K to choose? This is a very difficult question. When visualized it is not always easy to assess the results (especially if we cannot plot them directly). Should we consider a large cluster as one cluster or should we consider the sub-clusters as individual clusters? There are some metrics to help but there is no definite answer. For biological datasets there is no “ground truth” that we can compare against. What remains is the careful assessment of range of possible K values using different clustering metrics. For example we may check the cluster stability (so the resistance of clusters to data perturbation) or assess the similarity between samples assigned to the same cluster (cluster silhouettes).
Moreover it is important to leverage the shortcomings of used model and biological data. Here, the clusters found (when using K=7) do not correspond to most commonly used subtypes per the TCGA metadata. However we know that the subtype assignments were created using only the gene expression, so we could expect that it could be the case. We will re-visit this subtype-cluster comparison in the further section of this tutorial.
mfa.clusters %>%
as_tibble() %>%
mutate(sample_codes=rownames(mfa.clusters)) %>%
left_join(mfa.h.tsne.tbl, by = "sample_codes") %>%
gather(kmeans, label, -sample_codes, -tSNE_1, -tSNE_2) %>%
mutate(label=as.factor(label)) %>%
ggplot() +
geom_point(aes(x=tSNE_1, y=tSNE_2, color=label, alpha=0.1)) +
facet_wrap(kmeans~., ncol=3) +
labs(title="tSNE of MFA [K-means]", color="label") +
guides(alpha=FALSE, size=FALSE, color=FALSE)
Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> = "none")` instead.
(Optional) exercise:
Hint: The criteria used for the PCA analysis to select the number of components to keep would apply here. Start with investigating the “r.mfa$eig” matrix.
iCluster takes a Bayesian approach to the latent variable model. In Bayesian statistics, we infer distributions over model parameters, rather than finding a single maximum-likelihood parameter estimate. In iCluster, we model the data as:
X(i) = W(i)Z + ϵ_i,
where: - X(i) is a data matrix from a single omics platform, - W(i) are model parameters, - Z is a latent variable matrix shared among different omics platforms, - ϵ_i is a normally distributed noise matrix, a random variable ϵ_i ∼ N(0,Ψ) with Ψ=diag(ψ_1,…ψ_M) a diagonal covariance matrix.
With this construction, the omics measurements X(i) are expected to be similar for samples with the same latent variable representation up to Gaussian noise. iCluster maximizes the likelihood of the observed data with an additional Lasso-regularization to impose sparsity on W(i) matrices. Optimization is performed using an EM-like algorithm, briefly:
Again the “K-means” is run on the lower dimension representation of the final Z to get the final clustering assignments.
iCluster+ is an extension of the iCluster framework, which allows for omics types to arise from distributions other than a Gaussian. While normal distributions are a good assumption for log-transformed, centered gene expression data, it is a poor model for binary mutations data, or copy number variation data. iCluster+ allows the different X(i) to have different distributions: - for binary mutations, X is drawn from a multivariate binomial - for normal, continuous data, X is drawn from a multivariate Gaussian - for copy number variations, X is drawn from a multinomial - for count data, X is drawn from a Poisson.
iCluster+ fits a regularized latent variable model based clustering that generates an integrated cluster assignment based on joint Bayesian inference across data types.
iCluster+ remains one of the most popular mulit-omic integration method, however it is important to note that it supports only the integration of data matrices with equal number of samples and no missing values. Here, we subset our dataset and use only the samples common for all our data types. Moreover the execution time of iCluster+ is definitely worth consideration. For the datasets with high number of features, additional feature selection step may be required.
How do we decide on model parameters? Usually, we would like to use the parallel computing (and “tune.iClusterPlus” function) to search throughout the parameter space to look for the best model or the minimum Bayesian information criterion (BIC) value. When fitting models, it is possible to increase the likelihood by adding parameters, but doing so may result in overfitting. BIC attempts to resolve this problem by introducing a penalty term for the number of parameters in the model. The number of point to sample (n.lambda) depends on the number of data.types - see the iCluster+ user manual for more details. This step is very time consuming so we omit it in this tutorial.
In MFA the Dim.1 is the factor which has the more variance, if we don’t know if this is the case it is a good idea to look closer at our factor matrix. Here is a quick way to plot the various pairs of factors:
as_tibble(icluster.z, rownames="sample_codes") %>%
left_join(subtypes_data, by = "sample_codes") %>%
select(sample_codes,V1,V2,V3,V4,V5,V6,Subtype_Selected) %>%
na.omit() %>%
ggpairs(columns=2:7, ggplot2::aes(col=Subtype_Selected))
plot: [1,1] [=>----------------------------------------------------------------------] 3% est: 0s
plot: [1,2] [===>--------------------------------------------------------------------] 6% est:11s
plot: [1,3] [=====>------------------------------------------------------------------] 8% est: 9s
plot: [1,4] [=======>----------------------------------------------------------------] 11% est: 8s
plot: [1,5] [=========>--------------------------------------------------------------] 14% est: 8s
plot: [1,6] [===========>------------------------------------------------------------] 17% est: 7s
plot: [2,1] [=============>----------------------------------------------------------] 19% est: 7s
plot: [2,2] [===============>--------------------------------------------------------] 22% est: 7s
plot: [2,3] [=================>------------------------------------------------------] 25% est: 7s
plot: [2,4] [===================>----------------------------------------------------] 28% est: 7s
plot: [2,5] [=====================>--------------------------------------------------] 31% est: 6s
plot: [2,6] [=======================>------------------------------------------------] 33% est: 6s
plot: [3,1] [=========================>----------------------------------------------] 36% est: 6s
plot: [3,2] [===========================>--------------------------------------------] 39% est: 6s
plot: [3,3] [=============================>------------------------------------------] 42% est: 5s
plot: [3,4] [===============================>----------------------------------------] 44% est: 5s
plot: [3,5] [=================================>--------------------------------------] 47% est: 5s
plot: [3,6] [===================================>------------------------------------] 50% est: 5s
plot: [4,1] [=====================================>----------------------------------] 53% est: 5s
plot: [4,2] [=======================================>--------------------------------] 56% est: 4s
plot: [4,3] [=========================================>------------------------------] 58% est: 4s
plot: [4,4] [===========================================>----------------------------] 61% est: 4s
plot: [4,5] [=============================================>--------------------------] 64% est: 4s
plot: [4,6] [===============================================>------------------------] 67% est: 4s
plot: [5,1] [=================================================>----------------------] 69% est: 4s
plot: [5,2] [===================================================>--------------------] 72% est: 4s
plot: [5,3] [=====================================================>------------------] 75% est: 4s
plot: [5,4] [=======================================================>----------------] 78% est: 3s
plot: [5,5] [=========================================================>--------------] 81% est: 3s
plot: [5,6] [===========================================================>------------] 83% est: 3s
plot: [6,1] [=============================================================>----------] 86% est: 2s
plot: [6,2] [===============================================================>--------] 89% est: 2s
plot: [6,3] [=================================================================>------] 92% est: 1s
plot: [6,4] [===================================================================>----] 94% est: 1s
plot: [6,5] [=====================================================================>--] 97% est: 0s
plot: [6,6] [========================================================================]100% est: 0s
# plot specific pair of factors
as_tibble(icluster.z, rownames="sample_codes") %>%
left_join(subtypes_data, by = "sample_codes") %>%
select(sample_codes,V1,V2,V3,V4,V5,V6,Subtype_Selected) %>%
ggplot(aes(V5,V6,col=Subtype_Selected)) + geom_point()
icp.tsne <- tsne(icluster.z)
Warning in if (class(X) == "dist") { :
the condition has length > 1 and only the first element will be used
sigma summary: Min. : 0.486238809378115 |1st Qu. : 0.556784386115142 |Median : 0.591771892526059 |Mean : 0.597348301718699 |3rd Qu. : 0.631383122293027 |Max. : 0.722144582891478 |
Epoch: Iteration #100 error is: 14.8948716818441
Epoch: Iteration #200 error is: 0.796895352752098
Epoch: Iteration #300 error is: 0.710924065880571
Epoch: Iteration #400 error is: 0.69540961995969
Epoch: Iteration #500 error is: 0.695104334630388
Epoch: Iteration #600 error is: 0.695103687742692
Epoch: Iteration #700 error is: 0.695103681173074
Epoch: Iteration #800 error is: 0.695103681108171
Epoch: Iteration #900 error is: 0.695103681107535
Epoch: Iteration #1000 error is: 0.695103681107529
icp.tsne.tbl <- as_tibble(icp.tsne) %>%
rename(tSNE_1=V1, tSNE_2=V2) %>%
mutate(sample_codes = colnames(exp.common))
p1 <- icp.tsne.tbl %>%
left_join(subtypes_data, by = "sample_codes") %>%
ggplot() + geom_point(aes(x=tSNE_1, y=tSNE_2, color=as.factor(Subtype_miRNA), size=1, alpha=0.5)) +
labs(title="tSNE for iCluster+ [miRNA subtypes]", color="miRNA subtype") + guides(alpha=FALSE, size=FALSE)
Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> = "none")` instead.
p2 <- icp.tsne.tbl %>%
left_join(subtypes_data, by = "sample_codes") %>%
ggplot() + geom_point(aes(x=tSNE_1, y=tSNE_2, color=as.factor(Subtype_Selected), size=1, alpha=0.5)) +
labs(title="tSNE for iCluster+ [mRNA subtypes]", color="mRNA subtype") + guides(alpha=FALSE, size=FALSE)
Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> = "none")` instead.
p1
p2
icp.tsne.tbl %>%
left_join(subtypes_data, by = "sample_codes") %>%
mutate(icp_label=r.icluster$clusters) %>%
ggplot() + geom_point(aes(x=tSNE_1, y=tSNE_2, color=as.factor(icp_label), size=1, alpha=0.5)) +
labs(title="tSNE for iCluster+", color="iCluster+ label") + guides(alpha=FALSE, size=FALSE)
Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> = "none")` instead.
PINS is an example of late integration multi-omic framework. This method works by the integration of co-clustering patterns in separate omics. The patterns are based on the assessment of cluster robustness to the perturbation of original data.
PINS clusters each data type separately and creates a connectivity matrix S_i (n x n), where n is the number of samples. The S_i(xy) = 1 if two samples are clustered together and 0 otherwise. The original data is then perturbed multiple times by the addition of Gaussian noise. Resulted connectivity matrices are then used to create a similarity matrix, by assessing the average pair-wise connectivity between the samples. In the ideal case the similarity matrix defines groups of patients that are strongly connected across all of that data types. However, as we observed previously in practice it may not be the case. In this case the average connectivity value is used to create the similarity matrix. Finally the similarity matrix is partitioned to find sample labels. To assess if data follows hierarchical structure the connectivity assessment is repeated separately for sample groups identified in first stage, so samples can be split into subgroups if it’s possible.
PINS+ is an extension of PINS that greatly improve the efficacy of the method. PINS+ require each data type to have the same number of samples, and no missing values in feature matrices.
pins.ret = SubtypingOmicsData(dataList=list(t(exp.common), t(met.common), t(mirna.common)),
kMax = 9,
ncore = 2) #number of cores for parallel computing
Clustering method: kmeans
Perturbation method: noise
Building original connectivity matrices
|
| | 0%
|
|==================== | 25%
|
|============================= | 38%
|
|======================================= | 50%
|
|================================================= | 62%
|
|========================================================== | 75%
|
|==================================================================== | 88%
|
|==============================================================================| 100%
Building perturbed connectivity matrices
|
| | 0%
|
| | 1%
|
|= | 1%
|
|= | 2%
|
|== | 2%
|
|== | 3%
|
|=== | 3%
|
|=== | 4%
|
|==== | 4%
|
|==== | 5%
|
|==== | 6%
|
|===== | 6%
|
|===== | 7%
|
|====== | 7%
|
|====== | 8%
|
|======= | 8%
|
|======= | 9%
|
|======= | 10%
|
|================ | 21%
|
|================= | 21%
|
|========================= | 32%
|
|================================== | 44%
|
|=========================================== | 55%
|
|==================================================== | 66%
|
|============================================================ | 78%
|
|===================================================================== | 89%
|
|====================================================================== | 89%
|
|====================================================================== | 90%
|
|======================================================================= | 90%
|
|======================================================================= | 91%
|
|======================================================================= | 92%
|
|======================================================================== | 92%
|
|======================================================================== | 93%
|
|========================================================================= | 93%
|
|========================================================================= | 94%
|
|========================================================================== | 94%
|
|========================================================================== | 95%
|
|========================================================================== | 96%
|
|=========================================================================== | 96%
|
|=========================================================================== | 97%
|
|============================================================================ | 97%
|
|============================================================================ | 98%
|
|============================================================================= | 98%
|
|============================================================================= | 99%
|
|==============================================================================| 99%
|
|==============================================================================| 100%
Done in 40.902055978775 secs.
Clustering method: kmeans
Perturbation method: noise
Building original connectivity matrices
|
| | 0%
|
|==================== | 25%
|
|============================= | 38%
|
|======================================= | 50%
|
|================================================= | 62%
|
|========================================================== | 75%
|
|==================================================================== | 88%
|
|==============================================================================| 100%
Building perturbed connectivity matrices
|
| | 0%
|
| | 1%
|
|= | 1%
|
|= | 2%
|
|== | 2%
|
|== | 3%
|
|=== | 3%
|
|=== | 4%
|
|==== | 4%
|
|==== | 5%
|
|==== | 6%
|
|===== | 6%
|
|===== | 7%
|
|====== | 7%
|
|====== | 8%
|
|======= | 8%
|
|======= | 9%
|
|======= | 10%
|
|======== | 10%
|
|================ | 21%
|
|================= | 21%
|
|========================= | 32%
|
|================================== | 44%
|
|=========================================== | 55%
|
|==================================================== | 66%
|
|============================================================ | 78%
|
|===================================================================== | 89%
|
|====================================================================== | 89%
|
|====================================================================== | 90%
|
|=========================================================================== | 96%
|
|=========================================================================== | 97%
|
|============================================================================ | 97%
|
|============================================================================ | 98%
|
|============================================================================= | 98%
|
|============================================================================= | 99%
|
|==============================================================================| 99%
|
|==============================================================================| 100%
Done in 11.2795128822327 secs.
Clustering method: kmeans
Perturbation method: noise
Building original connectivity matrices
|
| | 0%
|
|==================== | 25%
|
|============================= | 38%
|
|======================================= | 50%
|
|================================================= | 62%
|
|========================================================== | 75%
|
|==================================================================== | 88%
|
|==============================================================================| 100%
Building perturbed connectivity matrices
|
| | 0%
|
| | 1%
|
|= | 1%
|
|= | 2%
|
|== | 2%
|
|== | 3%
|
|=== | 3%
|
|=== | 4%
|
|==== | 4%
|
|==== | 5%
|
|==== | 6%
|
|===== | 6%
|
|===== | 7%
|
|====== | 7%
|
|====== | 8%
|
|======= | 8%
|
|======= | 9%
|
|======= | 10%
|
|======== | 10%
|
|======== | 11%
|
|========= | 11%
|
|========= | 12%
|
|========== | 12%
|
|========== | 13%
|
|=========== | 14%
|
|=========== | 15%
|
|============ | 15%
|
|==================== | 26%
|
|============================ | 36%
|
|============================= | 37%
|
|===================================== | 47%
|
|===================================== | 48%
|
|====================================== | 48%
|
|====================================== | 49%
|
|======================================= | 49%
|
|======================================= | 50%
|
|======================================= | 51%
|
|======================================== | 51%
|
|=============================================== | 60%
|
|=============================================== | 61%
|
|================================================ | 61%
|
|================================================ | 62%
|
|================================================= | 62%
|
|================================================= | 63%
|
|================================================== | 64%
|
|================================================== | 65%
|
|=================================================== | 65%
|
|=================================================== | 66%
|
|==================================================== | 66%
|
|==================================================== | 67%
|
|===================================================== | 67%
|
|===================================================== | 68%
|
|===================================================== | 69%
|
|====================================================== | 69%
|
|====================================================== | 70%
|
|======================================================= | 70%
|
|============================================================ | 76%
|
|============================================================ | 77%
|
|============================================================ | 78%
|
|============================================================= | 78%
|
|============================================================= | 79%
|
|============================================================== | 79%
|
|============================================================== | 80%
|
|=============================================================== | 80%
|
|=============================================================== | 81%
|
|================================================================ | 81%
|
|================================================================ | 82%
|
|================================================================ | 83%
|
|==================================================================== | 87%
|
|==================================================================== | 88%
|
|===================================================================== | 88%
|
|===================================================================== | 89%
|
|====================================================================== | 89%
|
|====================================================================== | 90%
|
|======================================================================= | 90%
|
|======================================================================= | 91%
|
|========================================================================== | 95%
|
|========================================================================== | 96%
|
|=========================================================================== | 96%
|
|=========================================================================== | 97%
|
|============================================================================ | 97%
|
|============================================================================ | 98%
|
|============================================================================= | 98%
|
|============================================================================= | 99%
|
|==============================================================================| 99%
|
|==============================================================================| 100%
Done in 26.515655040741 secs.
STAGE : 1 Agreement : 0.214549251653324
Check if can proceed to stage II
Done in 1.33525195121765 mins.
clustering2 = pins.ret$cluster2
clustering1 = pins.ret$cluster1
Since PINS+ provides only sample labels and statistics/metrics calculated for specific data types (like perturbed and unperturbed connectivity matrices), as oppose to integrative matrices for multi-omic dataset, we are going to investigate how the clustering label separate samples in specific datasets.
exp.pca <- prcomp(t(exp.common))
mirna.pca <- prcomp(t(mirna.common))
met.pca <- prcomp(t(met.common))
fviz_pca_ind(exp.pca, col.ind = factor(clustering2), geom="point", axes=c(1,2)) +
scale_shape_manual(values=rep(19, length(unique(clustering2)))) +
labs(title="Gene expression PCA [PINS+ labels]")
fviz_pca_ind(mirna.pca, col.ind = factor(clustering2), geom="point", axes=c(1,2)) +
scale_shape_manual(values=rep(19, length(unique(clustering2)))) +
labs(title="MicroRNA expression PCA [PINS+ labels]")
fviz_pca_ind(exp.pca, col.ind = factor(clustering2), geom="point", axes=c(1,2)) +
scale_shape_manual(values=rep(19, length(unique(clustering2)))) +
labs(title="DNA methylation PCA [PINS+ labels]")
SUMO is a novel multi-omic integration method that combines the omic-specific similarity driven approaches with the joint dimensionality reduction framework.
First each omic is inspected separately and the pairwise distances between samples are calculated. The distance metric used by default is the Euclidean Distance, which is appropriate for continuous data types, such as normalized count data and DNA methylation data we use in this tutorial. The calculated distances are then used to create the data type specific similarity matrices A_k, by applying the following radial basis function:
A(i,j) = exp( -ρ^2(i,j) / με(i)ε(j) ),
where: - A(i,j) is the similarity between samples “i” and “j” in set data type, - ρ(i,j) is the calculated distance between samples “i” and “j”, - μ - the hyperparameter of the kernel (by default 0.5), - ε(i) - average distance between sample “i” and its “K” nearest neighbors (by default “K” is set to 10% of samples in the data type, however, if the number of clusters is know it should be set to #samples / #clusters).
Depending on the data type used different distance metric may be appropriate (for example, cosine similarity was found to better represent the similarities between cells in scATACseq). Alternative methods to create the similarity matrices implemented currently by SUMO include: cosine similarity, Pearson and Spearman correlation. If needed this framework can be extended by the implementation of custom similarity metric that follows the constrains of A_k(i,j) ∈ \[0,1\] and A_k(i,i) = 1, where “k” is used as an index for the data type.
After the creation of omic-specific similarity matrices, all of them are jointly factorized, by the variant of symmetric non-negative matrix factorization (NMF). The symmetric NMF is the modification of the general matrix factorization formulation X ≈ WH (which we saw in the “MFA” section of this tutorial), where the decomposition is done on the symmetric matrix of non-negative values to improve the clustering quality. As the pairwise-sample similarity matrices meet these requirements, the Symmetric NMF could be formulated as: A ≈ HH^T.
To integrate the separate similarity matrices SUMO further extends this formulation to:
A(i) ≈ H S(i) H^T,
where: - H (n x k) is a non-negative decomposition shared between the data types and represents the “n” samples in a “k”-dimensional space that accounts for the similarities in all the omics, - S(i) (k x k) is a data type specific non-negative matrix accounting for the relationships between clusters of samples in set omic, - “i” here is used as an index for the data type, - “k” is the factorization rank / the desired number of clusters.
The above factorization is computed by minimizing the following objective function:
As discussed in the “MFA” section this cost function includes the regularization term that imposes the sparsity of the H matrix. Moreover, the introduced W(i) matrices (a data-type specific binary indicator matrices) are used to “mask” the values in similarity matrices if the distance between a pair samples could not be calculated correctly, i.e. if either the sample is missing from one of the data-types or due to many missing feature values the accuracy of the distance calculation is lowered. Thus, SUMO can handle both missing samples and missing feature values, with no need for data imputation and no additional feature selection.
The minimization of above objective function is a general optimization problem, that can be solved by the application of multiplicative update rules, briefly:
The sample labels are recovered from H matrix, by applying the maximum value criterion (finding which column/cluster have the maximum value for each sample).
It is important to note that this cost function is non jointly convex and as such we are not guaranteed to reach the global minimum, by following above steps and is sensitive to initial conditions. To deal with this problem SUMO implements a series of improvements: - initial values of H and S(i) are not fully random, but partially set using an SVD-based approach according to average similarity across the data types; - the factorization is run multiple times with differently initialized matrices; - sample labels recovered from each factorization run are used to create a consensus matrix (weighted by the residual error of each run) and final sample labels are recovered from the consensus matrix using the Normalized Cut algorithm. - finally, to improve the robustness of resulting clusters SUMO implements the subsampling approach, where each factorization is run for a subset of samples.
SUMO is a command-line package written in python. The data pre-processing process includes following steps: - (Optional) data filtering - It is recommended to remove the non-informative features (which we did during data exploration) and features with large fraction of missing values to speed up the computation. An example of different filtering criterion is the removal of genes that are not protein coding. - Data transformation - This step depends on the data type used. For the count data we already performed the log transformation, which is sufficient. For the methylation data (as recommended in SUMO documentation) we will transform beta-values into M-values, so a log2 ratio of the methylated to unmethylated counts. - Data normalization - Here, we perform the feature-wise (z-score) standardization.
REMEMBER: Run all SUMO commands listed below using command-line from the main directory of this repository.
Here is how you can run SUMO in prepare mode to generate similarity matrices:
sumo prepare -plot aml/sumo_prepare aml/exp_sumo.tsv,aml/met_sumo.tsv,aml/mirna_sumo.tsv aml/prepared.aml.npz
SUMO created an .npz file containing the multiplex similarity network and plotted the similarity matrices. Let’s inspect them closer in R. Can we see the groups of samples separating together in different data types?
# how to read .npz files in R with reticulate
np <- import("numpy")
npz <- np$load("./data/AML/prepared.aml.npz", allow_pickle=TRUE)
npz$files
[1] "0" "f0" "1" "f1" "2" "f2" "samples"
head(exp_sim) # rows full of NaN values indicate samples missing from this data type
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
[1,] NaN NaN NaN NaN NaN NaN NaN NaN NaN
[2,] NaN 1.00000000 NaN 0.02929302 0.03855912 0.03231121 0.05847803 NaN 0.06743043
[3,] NaN NaN NaN NaN NaN NaN NaN NaN NaN
[4,] NaN 0.02929302 NaN 1.00000000 0.05705135 0.05500248 0.09151551 NaN 0.01213601
[5,] NaN 0.03855912 NaN 0.05705135 1.00000000 0.04514824 0.12572737 NaN 0.01291242
[,10] [,11] [,12] [,13] [,14] [,15] [,16]
[1,] NaN NaN NaN NaN NaN NaN NaN
[2,] 0.021268101 0.03980781 0.03646780 0.06037175 0.03297970 0.03564341 0.03169155
[3,] NaN NaN NaN NaN NaN NaN NaN
[4,] 0.005266700 0.04732371 0.08253616 0.07200002 0.14780887 0.08548335 0.05939892
[5,] 0.002833459 0.09316707 0.02235478 0.09773760 0.05271856 0.01589699 0.03799648
[,17] [,18] [,19] [,20] [,21] [,22] [,23]
[1,] NaN NaN NaN NaN NaN NaN NaN
[2,] 0.06769549 0.05800317 0.02401795 0.03522868 0.05495677 0.08556508 0.02860343
[3,] NaN NaN NaN NaN NaN NaN NaN
[4,] 0.13385880 0.03659065 0.04225443 0.04275835 0.06622828 0.04741311 0.01517537
[5,] 0.05842979 0.13376065 0.03368954 0.03966509 0.11822760 0.04997414 0.01681815
[,24] [,25] [,26] [,27] [,28] [,29] [,30] [,31]
[1,] NaN NaN NaN NaN NaN NaN NaN NaN
[2,] 0.02101673 0.06454521 NaN 0.03021625 NaN 0.03473413 NaN 0.03705359
[3,] NaN NaN NaN NaN NaN NaN NaN NaN
[4,] 0.09437276 0.11126899 NaN 0.17808209 NaN 0.02860278 NaN 0.02136833
[5,] 0.01491937 0.03458659 NaN 0.05395756 NaN 0.02775380 NaN 0.01767368
[,32] [,33] [,34] [,35] [,36] [,37] [,38]
[1,] NaN NaN NaN NaN NaN NaN NaN
[2,] 0.03792173 0.06017867 0.028498608 0.03044517 0.02274138 0.04127560 0.04499318
[3,] NaN NaN NaN NaN NaN NaN NaN
[4,] 0.09041063 0.09387055 0.030659072 0.08458520 0.10551152 0.05854860 0.06302651
[5,] 0.01896793 0.04923514 0.009163469 0.01658366 0.01075113 0.04836930 0.11451772
[,39] [,40] [,41] [,42] [,43] [,44] [,45]
[1,] NaN NaN NaN NaN NaN NaN NaN
[2,] 0.09947387 0.10236993 0.04213354 0.02772159 0.06796188 0.05348542 0.04065543
[3,] NaN NaN NaN NaN NaN NaN NaN
[4,] 0.03447501 0.01549068 0.10085276 0.05204700 0.04862954 0.03732480 0.07063625
[5,] 0.04373448 0.01172544 0.05974789 0.02486513 0.05276042 0.05655157 0.04380100
[,46] [,47] [,48] [,49] [,50] [,51] [,52]
[1,] NaN NaN NaN NaN NaN NaN NaN
[2,] 0.03748900 0.03496217 0.01851700 0.03072031 0.04963319 0.07321459 0.03576056
[3,] NaN NaN NaN NaN NaN NaN NaN
[4,] 0.03297708 0.06068972 0.02642027 0.05267122 0.07097184 0.05463808 0.04455042
[5,] 0.03435969 0.01934531 0.04432569 0.01148062 0.12046898 0.03054188 0.04575199
[,53] [,54] [,55] [,56] [,57] [,58] [,59]
[1,] NaN NaN NaN NaN NaN NaN NaN
[2,] 0.052232860 0.03238120 0.05963358 0.04308807 0.04197793 0.03687117 0.09535236
[3,] NaN NaN NaN NaN NaN NaN NaN
[4,] 0.013317116 0.02471844 0.03698927 0.04889703 0.05941214 0.01861598 0.01207874
[5,] 0.008717538 0.03760440 0.11002231 0.04850964 0.04735059 0.01818986 0.02402578
[,60] [,61] [,62] [,63] [,64] [,65] [,66]
[1,] NaN NaN NaN NaN NaN NaN NaN
[2,] 0.04541391 0.02941712 0.03325812 0.05353069 0.04634896 0.03536849 0.02333687
[3,] NaN NaN NaN NaN NaN NaN NaN
[4,] 0.08294135 0.08149193 0.05520339 0.10476178 0.03494771 0.02539233 0.06016156
[5,] 0.12282251 0.02958701 0.01184876 0.07672152 0.03950307 0.04156664 0.02598547
[,67] [,68] [,69] [,70] [,71] [,72] [,73] [,74]
[1,] NaN NaN NaN NaN NaN NaN NaN NaN
[2,] 0.03231885 0.10484517 0.02412154 0.05729243 0.03438037 NaN 0.06469470 NaN
[3,] NaN NaN NaN NaN NaN NaN NaN NaN
[4,] 0.02596893 0.01937509 0.05395912 0.07132047 0.04077858 NaN 0.02870669 NaN
[5,] 0.03956917 0.03202129 0.01062581 0.09635466 0.10636879 NaN 0.04379109 NaN
[,75] [,76] [,77] [,78] [,79] [,80] [,81] [,82]
[1,] NaN NaN NaN NaN NaN NaN NaN NaN
[2,] 0.03805995 0.03652452 0.04074741 0.03361926 NaN 0.04378249 0.02572672 0.03599498
[3,] NaN NaN NaN NaN NaN NaN NaN NaN
[4,] 0.01906036 0.05933450 0.08456727 0.02901892 NaN 0.07382138 0.03971762 0.03670832
[5,] 0.03853469 0.05411687 0.04123131 0.02976899 NaN 0.02578953 0.04683357 0.14098145
[,83] [,84] [,85] [,86] [,87] [,88] [,89] [,90]
[1,] NaN NaN NaN NaN NaN NaN NaN NaN
[2,] 0.02748224 0.03058845 0.04291611 0.02955659 0.03177152 NaN NaN NaN
[3,] NaN NaN NaN NaN NaN NaN NaN NaN
[4,] 0.03822361 0.13785115 0.07271907 0.09680801 0.05568469 NaN NaN NaN
[5,] 0.04593831 0.01973457 0.04785602 0.03379352 0.04341642 NaN NaN NaN
[,91] [,92] [,93] [,94] [,95] [,96] [,97]
[1,] NaN NaN NaN NaN NaN NaN NaN
[2,] 0.03255972 0.03963721 0.15594599 0.05122829 0.02693678 0.04084426 0.03516412
[3,] NaN NaN NaN NaN NaN NaN NaN
[4,] 0.02894297 0.03082329 0.01286472 0.03677703 0.07440201 0.04505541 0.06452847
[5,] 0.04448294 0.03632106 0.02329751 0.03033604 0.03002948 0.01740730 0.06328068
[,98] [,99] [,100] [,101] [,102] [,103] [,104] [,105]
[1,] NaN NaN NaN NaN NaN NaN NaN NaN
[2,] 0.06035199 0.03823138 NaN NaN 0.04794202 0.03220914 0.03291348 0.04244090
[3,] NaN NaN NaN NaN NaN NaN NaN NaN
[4,] 0.04881067 0.03025375 NaN NaN 0.01774130 0.05154376 0.01308257 0.07062778
[5,] 0.06831451 0.04966748 NaN NaN 0.02162388 0.06878787 0.01395890 0.02195381
[,106] [,107] [,108] [,109] [,110] [,111] [,112]
[1,] NaN NaN NaN NaN NaN NaN NaN
[2,] 0.02587183 0.03743665 0.03101953 0.03849206 0.02115352 0.02861507 0.03712476
[3,] NaN NaN NaN NaN NaN NaN NaN
[4,] 0.12481570 0.04522178 0.11366799 0.08979736 0.17559367 0.07952962 0.02887136
[5,] 0.07357750 0.09127457 0.12529156 0.03592369 0.02646077 0.08437045 0.03663772
[,113] [,114] [,115] [,116] [,117] [,118] [,119] [,120]
[1,] NaN NaN NaN NaN NaN NaN NaN NaN
[2,] 0.04523256 0.020531002 0.05943144 NaN NaN 0.04230716 0.043123884 NaN
[3,] NaN NaN NaN NaN NaN NaN NaN NaN
[4,] 0.01769374 0.006109144 0.04110166 NaN NaN 0.08405948 0.016955842 NaN
[5,] 0.03327456 0.006716204 0.03428938 NaN NaN 0.03147113 0.009979083 NaN
[,121] [,122] [,123] [,124] [,125] [,126] [,127]
[1,] NaN NaN NaN NaN NaN NaN NaN
[2,] 0.03582418 0.03413106 0.02808484 0.04237175 0.03366344 0.029546176 0.04050261
[3,] NaN NaN NaN NaN NaN NaN NaN
[4,] 0.03850579 0.03645365 0.05215969 0.03684622 0.03398507 0.025506003 0.13604484
[5,] 0.01638852 0.05108081 0.04377163 0.04392173 0.03496533 0.005985414 0.04722165
[,128] [,129] [,130] [,131] [,132] [,133] [,134]
[1,] NaN NaN NaN NaN NaN NaN NaN
[2,] 0.04053678 0.03676134 0.02949769 0.03143392 0.02432203 0.04482189 0.04418514
[3,] NaN NaN NaN NaN NaN NaN NaN
[4,] 0.05796515 0.03023061 0.09110338 0.01729266 0.01560371 0.03683785 0.04771683
[5,] 0.01509214 0.02827966 0.04437470 0.02287694 0.02372941 0.05356994 0.05374566
[,135] [,136] [,137] [,138] [,139] [,140] [,141] [,142]
[1,] NaN NaN NaN NaN NaN NaN NaN NaN
[2,] 0.02448145 0.04917664 0.03516652 0.03107523 NaN 0.05360952 NaN 0.03980567
[3,] NaN NaN NaN NaN NaN NaN NaN NaN
[4,] 0.02111193 0.08319889 0.04374469 0.05101780 NaN 0.05230003 NaN 0.06649712
[5,] 0.02987476 0.04243530 0.04139285 0.05111780 NaN 0.05704204 NaN 0.01792649
[,143] [,144] [,145] [,146] [,147] [,148] [,149]
[1,] NaN NaN NaN NaN NaN NaN NaN
[2,] 0.05476742 0.04142285 0.021360340 0.03329171 0.03000123 0.03200948 NaN
[3,] NaN NaN NaN NaN NaN NaN NaN
[4,] 0.06776480 0.04513111 0.010172002 0.06745665 0.05212817 0.04035318 NaN
[5,] 0.06780653 0.17276037 0.005944039 0.02541565 0.09030034 0.01976053 NaN
[,150] [,151] [,152] [,153] [,154] [,155] [,156]
[1,] NaN NaN NaN NaN NaN NaN NaN
[2,] 0.02382175 0.05238554 0.05346986 0.04183125 0.03819598 0.04944435 NaN
[3,] NaN NaN NaN NaN NaN NaN NaN
[4,] 0.15409997 0.07493732 0.03671969 0.09933608 0.06762501 0.07151055 NaN
[5,] 0.08728686 0.09831699 0.08244706 0.06835702 0.09561705 0.03913796 NaN
[,157] [,158] [,159] [,160] [,161] [,162] [,163]
[1,] NaN NaN NaN NaN NaN NaN NaN
[2,] 0.04413002 0.03958560 0.05469635 0.03840114 0.03104263 NaN 0.06113360
[3,] NaN NaN NaN NaN NaN NaN NaN
[4,] 0.04820116 0.13887292 0.09097889 0.02816311 0.12384099 NaN 0.04458376
[5,] 0.01462195 0.07152561 0.04396458 0.05631535 0.01527179 NaN 0.05146655
[,164] [,165] [,166] [,167] [,168] [,169] [,170]
[1,] NaN NaN NaN NaN NaN NaN NaN
[2,] 0.04821553 0.059608389 0.07067475 0.03153521 0.07992958 0.023661641 0.11403307
[3,] NaN NaN NaN NaN NaN NaN NaN
[4,] 0.04235331 0.011135513 0.05084007 0.01082462 0.03377348 0.100367152 0.01857226
[5,] 0.07416639 0.004995974 0.05178844 0.01296395 0.08622892 0.009837977 0.03044868
[,171] [,172] [,173] [,174] [,175] [,176] [,177]
[1,] NaN NaN NaN NaN NaN NaN NaN
[2,] 0.04202579 0.05079633 0.04674147 0.09241458 0.028142613 0.03245517 NaN
[3,] NaN NaN NaN NaN NaN NaN NaN
[4,] 0.06900675 0.01773376 0.04831160 0.01915872 0.051727635 0.03909531 NaN
[5,] 0.05766391 0.01789048 0.01131652 0.02810456 0.008793217 0.05294367 NaN
[,178] [,179] [,180] [,181] [,182] [,183] [,184]
[1,] NaN NaN NaN NaN NaN NaN NaN
[2,] 0.07271377 0.08250198 0.03858247 0.04101404 0.06454489 0.02193525 0.03963803
[3,] NaN NaN NaN NaN NaN NaN NaN
[4,] 0.02644414 0.01927933 0.06212058 0.02656257 0.03790696 0.01955756 0.07671649
[5,] 0.03547068 0.03368344 0.02509155 0.06097189 0.01649346 0.02954574 0.04784119
[,185] [,186] [,187] [,188] [,189] [,190] [,191]
[1,] NaN NaN NaN NaN NaN NaN NaN
[2,] NaN 0.07631647 0.20516215 0.04448786 0.08205528 0.03582751 0.04165079
[3,] NaN NaN NaN NaN NaN NaN NaN
[4,] NaN 0.04400337 0.02635502 0.04369068 0.02431377 0.04796955 0.01426185
[5,] NaN 0.04483612 0.04491639 0.05291603 0.03784362 0.03150906 0.01362028
[,192] [,193] [,194] [,195] [,196] [,197]
[1,] NaN NaN NaN NaN NaN NaN
[2,] 0.03930527 0.07957234 0.03434898 0.03771610 0.04253076 0.12082226
[3,] NaN NaN NaN NaN NaN NaN
[4,] 0.02484474 0.02009244 0.02160733 0.04580227 0.02546840 0.01921831
[5,] 0.05199814 0.04313639 0.04950704 0.06542719 0.03540070 0.03532830
[ reached getOption("max.print") -- omitted 1 row ]
draw(h, merge_legend=TRUE)
rownames(mirna_sim) <- sample_order
colnames(mirna_sim) <- sample_order
# remove missing samples from the matrix for the visualization
avail_samples <- colnames(mirna_sim)[rowSums(is.na(mirna_sim)) != dim(mirna_sim)[1]]
mirna_subtypes <- tibble(sample_codes=avail_samples) %>%
left_join(subtypes_data, by = "sample_codes") # in order of avail_samples
h <- Heatmap(mirna_sim[avail_samples, avail_samples], show_row_names=FALSE, show_column_names=FALSE,
name="miRNA expression SUMO similarity",
top_annotation=HeatmapAnnotation(subtype_miRNA=mirna_subtypes$Subtype_miRNA,
col = list(subtype_miRNA=miRNA_subtypes_col)),
left_annotation=rowAnnotation(subtype_miRNA=mirna_subtypes$Subtype_miRNA,
col = list(subtype_miRNA=miRNA_subtypes_col))
)
The automatically generated colors map from the 1^st and 99^th of the values
in the matrix. There are outliers in the matrix whose patterns might be hidden
by this color mapping. You can manually set the color to `col` argument.
Use `suppressMessages()` to turn off this message.
draw(h, merge_legend=TRUE)
rownames(met_sim) <- sample_order
colnames(met_sim) <- sample_order
# remove missing samples from the matrix for the visualization
avail_samples <- colnames(met_sim)[rowSums(is.na(met_sim)) != dim(met_sim)[1]]
Heatmap(met_sim[avail_samples, avail_samples], show_row_names=FALSE, show_column_names=FALSE,
name="DNA methylation SUMO similarity")
The automatically generated colors map from the 1^st and 99^th of the values
in the matrix. There are outliers in the matrix whose patterns might be hidden
by this color mapping. You can manually set the color to `col` argument.
Use `suppressMessages()` to turn off this message.
Here is how you can run SUMO to integrate and factorize created similarity matrices if the number of clusters is known.
sumo run aml/prepared.aml.npz 7 aml/sumo_k7
npz <- np$load(file.path(data_dir_path,'sumo_k7','k7','sumo_results.npz'), allow_pickle=TRUE)
npz$files
[1] "pac" "cophenet" "clusters"
[4] "consensus" "unfiltered_consensus" "quality"
[7] "samples" "config" "steps"
We can find two consensus matrices in this file: the unfiltered version and ‘consensus’, where all values lower then 0.5 where substituted to 0. SUMO uses the filtered version to find final sample labels. However, when making comparisons between different SUMO clusterings (see next section of this tutorial) the unfiltered version should always be used.
sample_order <- npz$f[['samples']]
con <- npz$f[['consensus']]
rownames(con) <- sample_order
colnames(con) <- sample_order
con_subtypes <- tibble(sample_codes=sample_order) %>%
left_join(subtypes_data, by = "sample_codes") # in order of sample_order
h <- Heatmap(con, show_row_names=FALSE, show_column_names=FALSE,
name="SUMO k=7 consensus matrix",
top_annotation=HeatmapAnnotation(subtype_mRNA=con_subtypes$Subtype_Selected,
col = list(subtype_mRNA=exp_subtypes_col)),
left_annotation=rowAnnotation(subtype_miRNA=con_subtypes$Subtype_miRNA,
col = list(subtype_miRNA=miRNA_subtypes_col))
)
draw(h, annotation_legend_side="bottom")
The clusters are pretty well established (so robust, regardless of the subsampling). We can observe the interplay between the miRNA and gene expression in driving the cluster separation. The cluster supported by both data types (AML.3/miRNA#5 which we observed previously) is separated. We can also observe a separation of a cluster of samples driven fully by miRNA (#2). A noteworthy amount of samples form miRNA cluster #4 is separating into 2 clusters. As separation of this two groups does not seem to be driven by gene expression, it is possible that the separation is appearing due to the methylation. Finally, even though the most commonly use TCGA classification is based on gene expression with k=7. However, running SUMO (with the same set number of clusters) on the integrated dataset we receive clusters that not only does not recapitulate the subtypes perfectly, but also seem be significantly driven by miRNA subtypes.
The estimation of the number of sample clusters “k” in the dataset (so the “factorization rank”) is a challenging problem. Again, it is recommended to run the method for a broad range of possible “k” values and compare the robustness of resulting clusters. For this purpose SUMO implements two metrics: - Cophnetic Correlation Coefficient (CCC) - a common metric for the NMF models, which measures the Pearson correlation between sample distances and its hierarchical clustering (we look for a high CCC value, typically > 0.95) - Proportion of Ambiguously Clustered pairs (PAC) - this metric make use of SUMO subsampling approach, by counting the proportion of values in consensus matrix in (0.1, 0.9) range (we look for low PAC value, ideally < 0.1)
Here is how we can run SUMO for a range of “k” values. Notice the use of “-t” argument which allows for specifying the number of cores for parallel computing. This will most likely take a couple of minutes.
sumo run -t 4 aml/prepared.aml.npz 2,10 aml/sumo
# let's plot the clustering metrics for all "k"
np <- import("numpy")
pac <- sapply(2:10, function(x){
npz <- np$load(file.path(data_dir_path,'sumo', paste0('k', x),'sumo_results.npz'), allow_pickle=TRUE)
return(npz$f[['pac']])
})
ccc <- sapply(2:10, function(x){
npz <- np$load(file.path(data_dir_path,'sumo', paste0('k', x),'sumo_results.npz'), allow_pickle=TRUE)
return(npz$f[['cophenet']])
})
colnames(pac) <- paste0(2:10)
pac <- as_tibble(pac) %>% gather(k, 'pac')
colnames(ccc) <- paste0(2:10)
as_tibble(ccc) %>%
gather(k, 'ccc') %>%
full_join(pac, by = "k") %>%
gather(metric, value, -k) %>%
group_by(k, metric) %>%
summarise(med=median(value), min=min(value), max=max(value)) %>%
mutate(k=as.numeric(k)) %>%
ggplot() +
geom_line(aes(x=k, y=med, color=metric, group=metric), size=1) +
geom_point(aes(x=k, y=med, color=metric), size=2) +
geom_ribbon(aes(x=k, ymin=min, ymax=max, fill=metric), alpha=0.2) +
facet_wrap(metric~., scales="free", ncol=1) +
theme(legend.position="null")
It seems that both 7 and 8 are viable results. Let’s compare these results:
read_tsv(file.path(data_dir_path,'sumo', paste0('k', 7),'clusters.tsv'), show_col_types = FALSE) %>%
rename(sumo_label_k7=label) %>%
full_join(read_tsv(file.path(data_dir_path,'sumo', paste0('k', 8),'clusters.tsv'),
show_col_types = FALSE), by = "sample") %>%
rename(sumo_label_k8=label) %>%
mutate(sumo_label_k7=as.factor(sumo_label_k7), sumo_label_k8=as.factor(sumo_label_k8)) %>%
group_by(sumo_label_k7, sumo_label_k8) %>%
summarise(nsamples=n()) %>%
ggplot(aes(y=nsamples, axis1=sumo_label_k7, axis2=sumo_label_k8)) +
geom_alluvium(aes(fill=sumo_label_k8), width = 0, knot.pos = 0, reverse = FALSE) +
guides(fill = FALSE) +
geom_stratum(width = 1/12, reverse = FALSE) +
geom_text(stat = "stratum", aes(label = after_stat(stratum)),
reverse = FALSE) +
scale_x_continuous(breaks = 1:2, labels = c("SUMO k=7","SUMO k=8")) +
labs(title="SUMO", y="#samples")
read_tsv(file.path(data_dir_path,'sumo', paste0('k', 7),'clusters.tsv'), show_col_types = FALSE) %>%
rename(sumo_label_k7=label) %>%
group_by(sumo_label_k7) %>%
summarise(nsample=n())
read_tsv(file.path(data_dir_path,'sumo', paste0('k', 8),'clusters.tsv'), show_col_types = FALSE) %>%
rename(sumo_label_k8=label) %>%
group_by(sumo_label_k8) %>%
summarise(nsample=n())
This is a good result. It the wide “flow lines” above show that most of samples group into the same clusters for both k=7 and k=8. However one sample grouping into separate cluster for k=7 suggest slightly less stable result. Due to this we are going to proceed with k=8.
In this section we are going to look into a couple of examples of downstream analysis of multi-omic data integration and molecular subtyping.
# read the survival data
surv_data <-read_tsv(file.path(data_dir_path, "survival"), show_col_types = FALSE)
surv_data <- surv_data %>% #unify sample ids
separate(PatientID, c('tcga', 'tss', 'participant', 'st'), sep="-") %>%
unite(c(tcga, tss, participant), col="samples", sep="-") %>%
select(-st)
surv_data <- subtypes_data %>%
select(samples, Subtype_miRNA, Subtype_Selected) %>%
left_join(surv_data, by = "samples") %>%
distinct()
head(surv_data)
dim(surv_data)
First, let’s see if the groups of samples separated by the TCGA samples are different in terms of survival.
ggsurvplot(survfit(Surv(Survival, Death) ~ Subtype_miRNA, data=surv_data),
data=surv_data,
palette = "npg",
pval = TRUE,
ggtheme = theme_bw(),
risk.table = TRUE) +
guides(colour = guide_legend(nrow = 2))
ggsurvplot(survfit(Surv(Survival, Death) ~ Subtype_Selected, data=surv_data),
data=surv_data,
palette = "npg",
pval = TRUE,
ggtheme = theme_bw(),
risk.table = TRUE) +
guides(colour = guide_legend(ncol = 2))
Now let’s investigate the recent SUMO clustering result
sumo_labels <- read_tsv(file.path(data_dir_path, "sumo", 'k8', 'clusters.tsv'), show_col_types = FALSE) %>%
separate(sample, c('tcga', 'tss', 'participant', 'st'), sep="\\.") %>%
unite(c(tcga, tss, participant), col="samples", sep="-") %>%
select(-st) %>%
left_join(surv_data, by = "samples")
dim(sumo_labels)
head(sumo_labels)
ggsurvplot(survfit(Surv(Survival, Death) ~ label, data=sumo_labels),
data=sumo_labels,
palette = "npg",
pval = TRUE,
ggtheme = theme_bw(),
risk.table = TRUE) +
guides(colour = guide_legend(nrow = 2))
The p-value of log-rank test reported in case of both TCGA subtypes and SUMO clusters is very significant, which means that in all cases there is at least one group of samples with significantly different survival. In case of TCGA subtypes it is the previously observed miRNA#5/AML.3 subtype of samples. As shown below, almost all of those samples are grouped in the same cluster by SUMO.
sumo_labels %>%
filter(Subtype_miRNA==5 & Subtype_Selected=="AML.3") %>%
group_by(label) %>%
summarise(n_samples=n())
It is important to note that biologically different subtypes (for example samples with very different pathway activity) may not always show the differences in survival. Additionally, this metric may be biased due to the differences in the patient treatment history.
Identification of features correlating with the cluster separation is a frequent goal of multi-omics data integration. These features can help to understand the source of the differences between the groups of samples and facilitate the discovery of novel biomarkers of the disease.
The most straightforward way to go about interpreting the latent factors in a biological context, is to look at the coefficients which are associated with them. For latent variable models that take the linear form X ≈ WH we want to investigate the factor matrix W, which contains coefficients tying each latent variable with each of the features in the original multi-omics data matrices. By inspecting these coefficients, we can get a sense of which multi-omics features are co-regulated.
# create an annotation dataframe for the heatmap for each feature, indicating its omics-type
data_anno <- tibble(
omics=c(rep('gene expression',dim(data_exp)[1]),
rep('DNA methylation',dim(data_met)[1]),
rep('miRNA expression',dim(data_mirna)[1])))
# generate the heat map
omics_col = setNames(RColorBrewer::brewer.pal(name = "Set2", n = 3), unique(data_anno$omics))
Heatmap(mfa.w, show_row_names=FALSE,
name="MFA feature coefficients",
left_annotation=rowAnnotation(omic=data_anno$omics, col = list(omic=omics_col)))
Closer inspection of this heatmap can reveal which feature values are predominantly associated with different factors of our latent space. In a perfect situation we would like this matrix to have a “disentangled” property where each feature is predominantly associated with only a single factor. However depending on a model and the dataset used it is not always possible. More sophisticated models, frequently apply additional feature selection step which allows to select only the “top” features, which contribute most to the factor separation.
In order to investigate the oncogenic processes that drive the differences between tumors, we may draw upon biological prior knowledge by looking for overlaps between genes that drive certain tumors, and genes involved in familiar biological processes.
We can take advantage of publicly available resources such as Gene Ontology Consortium’s GO terms, the Reactome pathways database, and the Kyoto Encyclopaedia of Genes and Genomes to find lists of so-called gene sets, or pathways, which are sets of genes which are known to operate together in some biological function, e.g. protein synthesis, DNA mismatch repair, cellular adhesion, and many other functions. Gene set enrichment analysis is a method which looks for overlaps between genes which we have found to be of interest, e.g. by them being implicated in a certain tumor type, and the a-priori gene sets discussed above.
In the context of making sense of latent factors, the question we will be asking is whether the genes which drive the value of a latent factor (the genes with the highest factor coefficients) also belong to any interesting annotated gene sets, and whether the overlap is greater than we would expect by chance. If there are “N” genes in total, “K” of which belong to a gene set, the probability that “k” out of the “n” genes associated with a latent factor are also associated with a gene set is given by the hypergeometric distribution:
The hypergeometric test uses the hypergeometric distribution to assess the statistical significance of the presence of genes belonging to a gene set in the latent factor. The null hypothesis is that there is no relationship between genes in a gene set, and genes in a latent factor. When testing for over-representation of gene set genes in a latent factor, the “P” value from the hypergeometric test is the probability of getting “k” or more genes from a gene set in a latent factor:
The hypergeometric enrichment test is also referred to as Fisher’s one-sided exact test. This way, we can determine if the genes associated with a factor significantly overlap (beyond chance) the genes involved in a biological process. Because we will typically be testing many gene sets, we will also need to apply multiple testing correction, such as Benjamini-Hochberg correction.
In the example below, we look for the genes associated with preferentially with the first latent factor and use the enrichR package to query the Gene Ontology terms which might be overlapping:
# select genes associated preferentially with the fist latent factor
genes.factor.1 <- rownames(data_exp)[max.col(mfa.w[1:dim(data_exp)[1],]) == 1]
# call the enrichr function to find gene sets enriched in the fist latent factor
# in the GO Biological Processes 2018 library
go.factor.1 <- enrichR::enrichr(genes.factor.1,
databases = c("GO_Biological_Process_2018")
)$GO_Biological_Process_2018
Since SUMO belongs to the class of multi-omic data integration models that use similarity matrices rather then the feature matrices, its ability to interpret the features contribution to sample separation directly by examination of factorized matrices is limited. Thus, SUMO package uses a separate gradient boosting classifier implemented in LightGBM - a tree-based model to detect the importance of each feature towards each cluster.
If you want to learn more about the LightGBM check this article.
Here is how you can run SUMO in interpret mode to find features supporting the cluster separation. This stop is very computationally expensive, due to performing the hyperparameter optimization using a random search with 5-fold cross-validation (performed on 80% of features) to avoid overfitting. Notice the use of “-t” argument which allows for specifying the number of cores for parallel computing.
Let’s investigate the “top” features of importance for each cluster. Here we are supplying the label of cluster with different prognosis compared to other groups of samples. In this run it was label 6.
data <- read_tsv(file.path(data_dir_path, 'interpret_k8.tsv'), show_col_types = FALSE) %>%
gather(group, importance, -feature)
labels <- read_tsv(file.path(data_dir_path, 'sumo', 'k8', 'clusters.tsv'), show_col_types = FALSE)
selected_label <- 6
top_features <- data %>%
filter(group == paste("GROUP", selected_label, sep="_")) %>%
arrange(desc(importance)) %>% top_n(6, importance)
Exercise: Extract values of of top features from the pre-processed feature matrices and confirm if each feature of interest has significantly different values for the cluster of samples of interest. (Hint: for comparisons of features values from continuous data you can use the Kruskal-Wallis test)
Think about an aspect of multi-omic integration that is the most interesting for you and how you can use the gain knowledge.
Here is a couple of ideas worth exploring:
How does the imputation of the data changes the results? Remove some values from the dataset and inspect the different type of data imputation vs full data using SUMO.
The integration of non-continuous data types. As mentioned above the iCluster supports both somatic mutation and copy number data. The SUMO documentation includes the vignette showing how somatic mutations can be converted into continuous dataset and integrated. How does the integration of somatic mutations change the sample classification? (Hint: you can download the somatic mutation data from UCSC Xena.)
The cBioPortal is a great resource for the cancer genomic analysis. Use the cBioPortal to find out which genes are most frequently mutated in your dataset. Download the somatic mutation data and investigate if some of the found groups of samples are especially enriched for somatic mutations in those genes (Hint: you can download the somatic mutation data from UCSC Xena.)